Here we present two datasets where curves do not have the same duration.

Dataset 1

Let us investigate two common approaches to time normalisation prior to GAMM modelling.

Approach 1: No time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(t1, by = Category)
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1850974  0.0006736  274.77   <2e-16 ***
## CategoryPEAK 0.0145556  0.0009525   15.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                         edf Ref.df    F p-value    
## s(t1):CategoryNO_PEAK 8.917  8.998 4563  <2e-16 ***
## s(t1):CategoryPEAK    8.933  8.999 3116  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.778   Deviance explained = 77.8%
## fREML = -25458  Scale est. = 0.0045003  n = 19900

Notice that:

  • The model does not complain if the curves have different durations
    • actually, it does not know there are curves!
  • Confidence bands get wider towards the end
  • Predicted curves are not capturing the salient second peak quite well (var explained 77%)

Approach 2: Linear time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(t_lin, by = Category)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.179668   0.000227  791.50   <2e-16 ***
## CategoryPEAK 0.024315   0.000321   75.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df     F p-value    
## s(t_lin):CategoryNO_PEAK 8.979      9 47803  <2e-16 ***
## s(t_lin):CategoryPEAK    8.998      9 36802  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.975   Deviance explained = 97.5%
## fREML = -47050  Scale est. = 0.00051256  n = 19900

In this case the result is excellent (var explained 97%). In fact, the data were generated by randomly applying a random linear time distorsion to each curve.

Dataset 2

Suppose the data look like this instead:

Let us apply the same approaches as in dataset 1.

Approach 1: No time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(time, by = Category)
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.2247908  0.0002404  934.88   <2e-16 ***
## CategoryPEAK 0.0190229  0.0003400   55.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df     F p-value    
## s(time):CategoryNO_PEAK 8.966      9 34965  <2e-16 ***
## s(time):CategoryPEAK    8.998      9 24985  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.972   Deviance explained = 97.2%
## fREML = -38123  Scale est. = 0.00045457  n = 15733

This is the perfect solution in this case. In fact, the data were generated by chopping the curves randmoly towards the end.

Approach 2: Linear time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(t_lin, by = Category)
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.2251063  0.0005743   392.0   <2e-16 ***
## CategoryPEAK 0.0182736  0.0008121    22.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df    F p-value    
## s(t_lin):CategoryNO_PEAK 8.746  8.981 5480  <2e-16 ***
## s(t_lin):CategoryPEAK    8.834  8.992 3602  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.839   Deviance explained = 83.9%
## fREML = -24456  Scale est. = 0.0025932  n = 15733

This time linear time normalisation creates distortions and artifacts in the solution.

Interim conclusions

Correlation between shape and duration

No categories

No time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2119276  0.0007156   296.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df    F p-value    
## s(time) 8.92  8.998 3364  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.753   Deviance explained = 75.3%
## fREML = -12112  Scale est. = 0.0050952  n = 9950

Linear time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(t_lin)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2119276  0.0004548   465.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df    F p-value    
## s(t_lin) 8.992      9 9956  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =    0.9   Deviance explained =   90%
## fREML = -16610  Scale est. = 0.0020585  n = 9950

Duration as covariate

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ te(t_lin, duration, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2119276  0.0004486   472.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                      edf Ref.df    F p-value    
## te(t_lin,duration) 71.79     82 1127  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.903   Deviance explained = 90.3%
## fREML = -16640  Scale est. = 0.0020025  n = 9950

Notice that:

  • The model captures somehow the correlation between duration and shape change
    • Though some parts of the time-by-duration space are not well modelled

With categories

No time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(time, by = Category)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.11650    0.01309   8.902   <2e-16 ***
## CategoryPEAK  0.19460    0.01310  14.856   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df     F p-value    
## s(time):CategoryNO_PEAK 7.151  7.376  6392  <2e-16 ***
## s(time):CategoryPEAK    8.975  9.000 10029  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.891   Deviance explained = 89.2%
## fREML = -30138  Scale est. = 0.0028116  n = 19900

Linear time normalisation

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(t_lin, by = Category)
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1415134  0.0003388   417.6   <2e-16 ***
## CategoryPEAK 0.1216671  0.0004792   253.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df     F p-value    
## s(t_lin):CategoryNO_PEAK 8.927  8.998 13318  <2e-16 ***
## s(t_lin):CategoryPEAK    8.997  9.000 27498  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.956   Deviance explained = 95.6%
## fREML = -39085  Scale est. = 0.0011421  n = 19900

Duration as covariate

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + te(t_lin, duration, by = Category, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.13807    0.03382   4.082 4.48e-05 ***
## CategoryPEAK  0.24412    0.03936   6.202 5.67e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                      edf Ref.df    F p-value    
## te(t_lin,duration):CategoryNO_PEAK 50.69  56.23 2131  <2e-16 ***
## te(t_lin,duration):CategoryPEAK    56.19  58.48 4218  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.956   Deviance explained = 95.6%
## fREML = -38897  Scale est. = 0.0011445  n = 19900

Notice that:

  • The two categories are estimated reliably only within their duration range
  • Out of their duration range, estimates are off and with large confidence bands

This is to be expected because there are no samples of long NO_PEAK curves and of shot PEAK curves.

Besides that, using duration as covariate in a categorical setting is conceptually wrong, as duration is part of how the curve looks, just like its shape, and the purpose of the model should be to predict how the curve looks given its category. Duration is not part of the experimental design, so it should be a response variable, not a predictor.

Alternative approaches to joint modelling of shape and duration

An approach based on multidimensional functional PCA allows to jointly analyse curves together with their time distortion representation. This is more general than applying linear time normalisation and it is based on landmark registration.

See this paper, appendix A in this paper, this paper and material available in this repo (under notes/ and projects/).